Knowledge-Free Induction of Morphology Using Latent Semantic Analysis
نویسندگان
چکیده
Morphology induction is a subproblem of important tasks like automatic learning of machine-readable dictionaries and grammar induction. Previous morphology induction approaches have relied solely on statistics of hypothesized stems and affixes to choose which affixes to consider legitimate. Relying on stemand-affix statistics rather than semantic knowledge leads to a number of problems, such as the inappropriate use of valid affixes ("ally" stemming to "all"). We introduce a semantic-based algorithm for learning morphology which only proposes affixes when the stem and stem-plusaffix are sufficiently similar semantically. We implement our approach using Latent Semantic Analysis and show that our semantics-only approach provides morphology induction results that rival a current state-of-the-art system. 1 I n t r o d u c t i o n Computational morphological analyzers have existed in various languages for years and it has been said that "the quest for an efficient method for the analysis and generation of word-forms is no longer an academic research topic" (Karlsson and Karttunen, 1997). However, development of these analyzers typically begins with human intervention requiring time spans from days to weeks. If it were possible to build such analyzers automatically without human knowledge, significant development time could be saved. On a larger scale, consider the task of inducing machine-readable dictionaries (MRDs) using no human-provided information ("knowledge-free"). In building an MRD, "simply expanding the dictionary to encompass every word one is ever likely to encounter...fails to take advantage of regularities" (Sproat, 1992, p. xiii). Hence, automatic morphological analysis is also critical for selecting appropriate and non-redundant MRD headwords. For the reasons expressed above, we are interested in knowledge-free morphology induction. Thus, in this paper, we show how to automatically induce morphological relationships between words. Previous morphology induction approaches (Goldsmith, 1997, 2000; D4Jean, 1998; Gaussier, 1999) have focused on inflectional languages and have used statistics of hypothesized stems and affixes to choose which affixes to consider legitimate. Several problems can arise using only stem-and-affix statistics: (1) valid affixes may be applied inappropriately ("ally" stemming to "all"), (2) morphological ambiguity may arise ("rating" conflating with "rat" instead of "rate"), and (3) non-productive affixes may get accidentally pruned (the relationship between "dirty" and "dirt" may be lost)3 Some of these problems could be resolved if one could incorporate word semantics. For instance, "all" is not semantically similar to "ally," so with knowledge of semantics, an algorithm could avoid conflating these two words. To maintain the "knowledge-free" paradigm, such semantics would need to be automatically induced. Latent Semantic Analysis (LSA) (Deerwester, et al., 1990); Landauer, et al., 1998) is a technique which automatically identifies semantic information from a corpus. We here show that incorporating LSA-based semantics alone into the morphology-induction process can provide results that rival a state-oh the-art system based on stem-and-affix statistics (Goldsmith's Linguistica). 1Error examples are from Goldsmith 's Linguistica
منابع مشابه
Is Knowledge-Free Induction of Multiword Unit Dictionary Headwords a Solved Problem?
We seek a knowledge-free method for inducing multiword units from text corpora for use as machine-readable dictionary headwords. We provide two major evaluations of nine existing collocation-finders and illustrate the continuing need for improvement. We use Latent Semantic Analysis to make modest gains in performance, but we show the significant challenges encountered in trying this approach.
متن کاملQuery expansion based on relevance feedback and latent semantic analysis
Web search engines are one of the most popular tools on the Internet which are widely-used by expert and novice users. Constructing an adequate query which represents the best specification of users’ information need to the search engine is an important concern of web users. Query expansion is a way to reduce this concern and increase user satisfaction. In this paper, a new method of query expa...
متن کاملLatent Semantic Grammar Induction: Context, Projectivity, and Prior Distributions
This paper presents latent semantic grammars for the unsupervised induction of English grammar. Latent semantic grammars were induced by applying singular value decomposition to n-gram by context-feature matrices. Parsing was used to evaluate performance. Experiments with context, projectivity, and prior distributions show the relative performance effects of these kinds of prior knowledge. Resu...
متن کاملBootstrapping Language Description: the case of Mpiemo (Bantu A, Central African Republic)
Linguists have long been producing grammatical decriptions of yet undescribed languages. This is a time-consuming process, which has already adapted to improved technology for recording and storage. We present here a novel application of NLP techniques to bootstrap analysis of collected data and speed-up manual selection work. To be more precise, we argue that unsupervised induction of morpholo...
متن کاملStatement for Irina Matveeva
My research interest is to improve natural language applications by developing efficient unsupervised and semi-supervised machine learning approaches. My approach is to design machine learning solutions tailored to specific natural language problems based on an in-depth analysis of their components. I believe that machine learning algorithms are most efficient for language applications if they ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000